Until now we've used single comma-delimited and tab-delimited files as our source of data. For this project we'll look at 2,000 individual files where each file contains the text of a review. The labels are determined by the subdirectory that holds the file; that is, positive reviews are stored in a \pos\
directory while negative reviews live under \neg\
. Refer to moviereviesREADME.txt for more information about the files.
We'll show two different methods to extract the text of each file in each directory, and build our labeled corpus:
CategorizedPlaintextCorpusReader
In [1]:
# Perform imports:
import numpy as np
import pandas as pd
import os
In [14]:
gen = os.walk('../moviereviews')
next(gen)
Out[14]:
('../moviereviews - Copy', ['neg', 'pos'], ['poldata.README.2.0'])
os.walk()
is a generator that returns a tuple with three items:
In [15]:
next(gen)
Out[15]:
('../moviereviews - Copy\\neg',
[],
['cv000_29416.txt',
'cv001_19502.txt',
'cv002_17424.txt',
'cv003_12683.txt',
'cv004_12641.txt',
'cv005_29357.txt',
'cv006_17022.txt',
'cv007_4992.txt',
'cv008_29326.txt',
'cv009_29417.txt',
'cv010_29063.txt',
'cv011_13044.txt',
'cv012_29411.txt',
'cv013_10494.txt',
'cv014_15600.txt',
'cv015_29356.txt',
'cv016_4348.txt',
'cv017_23487.txt',
'cv018_21672.txt',
'cv019_16117.txt',
'cv020_9234.txt',
'cv021_17313.txt',
'cv022_14227.txt',
'cv023_13847.txt',
'cv024_7033.txt',
'cv025_29825.txt',
'cv026_29229.txt',
'cv027_26270.txt',
'cv028_26964.txt',
'cv029_19943.txt',
'cv030_22893.txt',
'cv031_19540.txt',
'cv032_23718.txt',
'cv033_25680.txt',
'cv034_29446.txt',
'cv035_3343.txt',
'cv036_18385.txt',
'cv037_19798.txt',
'cv038_9781.txt',
'cv039_5963.txt',
'cv040_8829.txt',
'cv041_22364.txt',
'cv042_11927.txt',
'cv043_16808.txt',
'cv044_18429.txt',
'cv045_25077.txt',
'cv046_10613.txt',
'cv047_18725.txt',
'cv048_18380.txt',
'cv049_21917.txt',
'cv050_12128.txt',
'cv051_10751.txt',
'cv052_29318.txt',
'cv053_23117.txt',
'cv054_4101.txt',
'cv055_8926.txt',
'cv056_14663.txt',
'cv057_7962.txt',
'cv058_8469.txt',
'cv059_28723.txt',
'cv060_11754.txt',
'cv061_9321.txt',
'cv062_24556.txt',
'cv063_28852.txt',
'cv064_25842.txt',
'cv065_16909.txt',
'cv066_11668.txt',
'cv067_21192.txt',
'cv068_14810.txt',
'cv069_11613.txt',
'cv070_13249.txt',
'cv071_12969.txt',
'cv072_5928.txt',
'cv073_23039.txt',
'cv074_7188.txt',
'cv075_6250.txt',
'cv076_26009.txt',
'cv077_23172.txt',
'cv078_16506.txt',
'cv079_12766.txt',
'cv080_14899.txt',
'cv081_18241.txt',
'cv082_11979.txt',
'cv083_25491.txt',
'cv084_15183.txt',
'cv085_15286.txt',
'cv086_19488.txt',
'cv087_2145.txt',
'cv088_25274.txt',
'cv089_12222.txt',
'cv090_0049.txt',
'cv091_7899.txt',
'cv092_27987.txt',
'cv093_15606.txt',
'cv094_27868.txt',
'cv095_28730.txt',
'cv096_12262.txt',
'cv097_26081.txt',
'cv098_17021.txt',
'cv099_11189.txt',
'cv100_12406.txt',
'cv101_10537.txt',
'cv102_8306.txt',
'cv103_11943.txt',
'cv104_19176.txt',
'cv105_19135.txt',
'cv106_18379.txt',
'cv107_25639.txt',
'cv108_17064.txt',
'cv109_22599.txt',
'cv110_27832.txt',
'cv111_12253.txt',
'cv112_12178.txt',
'cv113_24354.txt',
'cv114_19501.txt',
'cv115_26443.txt',
'cv116_28734.txt',
'cv117_25625.txt',
'cv118_28837.txt',
'cv119_9909.txt',
'cv120_3793.txt',
'cv121_18621.txt',
'cv122_7891.txt',
'cv123_12165.txt',
'cv124_3903.txt',
'cv125_9636.txt',
'cv126_28821.txt',
'cv127_16451.txt',
'cv128_29444.txt',
'cv129_18373.txt',
'cv130_18521.txt',
'cv131_11568.txt',
'cv132_5423.txt',
'cv133_18065.txt',
'cv134_23300.txt',
'cv135_12506.txt',
'cv136_12384.txt',
'cv137_17020.txt',
'cv138_13903.txt',
'cv139_14236.txt',
'cv140_7963.txt',
'cv141_17179.txt',
'cv142_23657.txt',
'cv143_21158.txt',
'cv144_5010.txt',
'cv145_12239.txt',
'cv146_19587.txt',
'cv147_22625.txt',
'cv148_18084.txt',
'cv149_17084.txt',
'cv150_14279.txt',
'cv151_17231.txt',
'cv152_9052.txt',
'cv153_11607.txt',
'cv154_9562.txt',
'cv155_7845.txt',
'cv156_11119.txt',
'cv157_29302.txt',
'cv158_10914.txt',
'cv159_29374.txt',
'cv160_10848.txt',
'cv161_12224.txt',
'cv162_10977.txt',
'cv163_10110.txt',
'cv164_23451.txt',
'cv165_2389.txt',
'cv166_11959.txt',
'cv167_18094.txt',
'cv168_7435.txt',
'cv169_24973.txt',
'cv170_29808.txt',
'cv171_15164.txt',
'cv172_12037.txt',
'cv173_4295.txt',
'cv174_9735.txt',
'cv175_7375.txt',
'cv176_14196.txt',
'cv177_10904.txt',
'cv178_14380.txt',
'cv179_9533.txt',
'cv180_17823.txt',
'cv181_16083.txt',
'cv182_7791.txt',
'cv183_19826.txt',
'cv184_26935.txt',
'cv185_28372.txt',
'cv186_2396.txt',
'cv187_14112.txt',
'cv188_20687.txt',
'cv189_24248.txt',
'cv190_27176.txt',
'cv191_29539.txt',
'cv192_16079.txt',
'cv193_5393.txt',
'cv194_12855.txt',
'cv195_16146.txt',
'cv196_28898.txt',
'cv197_29271.txt',
'cv198_19313.txt',
'cv199_9721.txt',
'cv200_29006.txt',
'cv201_7421.txt',
'cv202_11382.txt',
'cv203_19052.txt',
'cv204_8930.txt',
'cv205_9676.txt',
'cv206_15893.txt',
'cv207_29141.txt',
'cv208_9475.txt',
'cv209_28973.txt',
'cv210_9557.txt',
'cv211_9955.txt',
'cv212_10054.txt',
'cv213_20300.txt',
'cv214_13285.txt',
'cv215_23246.txt',
'cv216_20165.txt',
'cv217_28707.txt',
'cv218_25651.txt',
'cv219_19874.txt',
'cv220_28906.txt',
'cv221_27081.txt',
'cv222_18720.txt',
'cv223_28923.txt',
'cv224_18875.txt',
'cv225_29083.txt',
'cv226_26692.txt',
'cv227_25406.txt',
'cv228_5644.txt',
'cv229_15200.txt',
'cv230_7913.txt',
'cv231_11028.txt',
'cv232_16768.txt',
'cv233_17614.txt',
'cv234_22123.txt',
'cv235_10704.txt',
'cv236_12427.txt',
'cv237_20635.txt',
'cv238_14285.txt',
'cv239_29828.txt',
'cv240_15948.txt',
'cv241_24602.txt',
'cv242_11354.txt',
'cv243_22164.txt',
'cv244_22935.txt',
'cv245_8938.txt',
'cv246_28668.txt',
'cv247_14668.txt',
'cv248_15672.txt',
'cv249_12674.txt',
'cv250_26462.txt',
'cv251_23901.txt',
'cv252_24974.txt',
'cv253_10190.txt',
'cv254_5870.txt',
'cv255_15267.txt',
'cv256_16529.txt',
'cv257_11856.txt',
'cv258_5627.txt',
'cv259_11827.txt',
'cv260_15652.txt',
'cv261_11855.txt',
'cv262_13812.txt',
'cv263_20693.txt',
'cv264_14108.txt',
'cv265_11625.txt',
'cv266_26644.txt',
'cv267_16618.txt',
'cv268_20288.txt',
'cv269_23018.txt',
'cv270_5873.txt',
'cv271_15364.txt',
'cv272_20313.txt',
'cv273_28961.txt',
'cv274_26379.txt',
'cv275_28725.txt',
'cv276_17126.txt',
'cv277_20467.txt',
'cv278_14533.txt',
'cv279_19452.txt',
'cv280_8651.txt',
'cv281_24711.txt',
'cv282_6833.txt',
'cv283_11963.txt',
'cv284_20530.txt',
'cv285_18186.txt',
'cv286_26156.txt',
'cv287_17410.txt',
'cv288_20212.txt',
'cv289_6239.txt',
'cv290_11981.txt',
'cv291_26844.txt',
'cv292_7804.txt',
'cv293_29731.txt',
'cv294_12695.txt',
'cv295_17060.txt',
'cv296_13146.txt',
'cv297_10104.txt',
'cv298_24487.txt',
'cv299_17950.txt',
'cv300_23302.txt',
'cv301_13010.txt',
'cv302_26481.txt',
'cv303_27366.txt',
'cv304_28489.txt',
'cv305_9937.txt',
'cv306_10859.txt',
'cv307_26382.txt',
'cv308_5079.txt',
'cv309_23737.txt',
'cv310_14568.txt',
'cv311_17708.txt',
'cv312_29308.txt',
'cv313_19337.txt',
'cv314_16095.txt',
'cv315_12638.txt',
'cv316_5972.txt',
'cv317_25111.txt',
'cv318_11146.txt',
'cv319_16459.txt',
'cv320_9693.txt',
'cv321_14191.txt',
'cv322_21820.txt',
'cv323_29633.txt',
'cv324_7502.txt',
'cv325_18330.txt',
'cv326_14777.txt',
'cv327_21743.txt',
'cv328_10908.txt',
'cv329_29293.txt',
'cv330_29675.txt',
'cv331_8656.txt',
'cv332_17997.txt',
'cv333_9443.txt',
'cv334_0074.txt',
'cv335_16299.txt',
'cv336_10363.txt',
'cv337_29061.txt',
'cv338_9183.txt',
'cv339_22452.txt',
'cv340_14776.txt',
'cv341_25667.txt',
'cv342_20917.txt',
'cv343_10906.txt',
'cv344_5376.txt',
'cv345_9966.txt',
'cv346_19198.txt',
'cv347_14722.txt',
'cv348_19207.txt',
'cv349_15032.txt',
'cv350_22139.txt',
'cv351_17029.txt',
'cv352_5414.txt',
'cv353_19197.txt',
'cv354_8573.txt',
'cv355_18174.txt',
'cv356_26170.txt',
'cv357_14710.txt',
'cv358_11557.txt',
'cv359_6751.txt',
'cv360_8927.txt',
'cv361_28738.txt',
'cv362_16985.txt',
'cv363_29273.txt',
'cv364_14254.txt',
'cv365_12442.txt',
'cv366_10709.txt',
'cv367_24065.txt',
'cv368_11090.txt',
'cv369_14245.txt',
'cv370_5338.txt',
'cv371_8197.txt',
'cv372_6654.txt',
'cv373_21872.txt',
'cv374_26455.txt',
'cv375_9932.txt',
'cv376_20883.txt',
'cv377_8440.txt',
'cv378_21982.txt',
'cv379_23167.txt',
'cv380_8164.txt',
'cv381_21673.txt',
'cv382_8393.txt',
'cv383_14662.txt',
'cv384_18536.txt',
'cv385_29621.txt',
'cv386_10229.txt',
'cv387_12391.txt',
'cv388_12810.txt',
'cv389_9611.txt',
'cv390_12187.txt',
'cv391_11615.txt',
'cv392_12238.txt',
'cv393_29234.txt',
'cv394_5311.txt',
'cv395_11761.txt',
'cv396_19127.txt',
'cv397_28890.txt',
'cv398_17047.txt',
'cv399_28593.txt',
'cv400_20631.txt',
'cv401_13758.txt',
'cv402_16097.txt',
'cv403_6721.txt',
'cv404_21805.txt',
'cv405_21868.txt',
'cv406_22199.txt',
'cv407_23928.txt',
'cv408_5367.txt',
'cv409_29625.txt',
'cv410_25624.txt',
'cv411_16799.txt',
'cv412_25254.txt',
'cv413_7893.txt',
'cv414_11161.txt',
'cv415_23674.txt',
'cv416_12048.txt',
'cv417_14653.txt',
'cv418_16562.txt',
'cv419_14799.txt',
'cv420_28631.txt',
'cv421_9752.txt',
'cv422_9632.txt',
'cv423_12089.txt',
'cv424_9268.txt',
'cv425_8603.txt',
'cv426_10976.txt',
'cv427_11693.txt',
'cv428_12202.txt',
'cv429_7937.txt',
'cv430_18662.txt',
'cv431_7538.txt',
'cv432_15873.txt',
'cv433_10443.txt',
'cv434_5641.txt',
'cv435_24355.txt',
'cv436_20564.txt',
'cv437_24070.txt',
'cv438_8500.txt',
'cv439_17633.txt',
'cv440_16891.txt',
'cv441_15276.txt',
'cv442_15499.txt',
'cv443_22367.txt',
'cv444_9975.txt',
'cv445_26683.txt',
'cv446_12209.txt',
'cv447_27334.txt',
'cv448_16409.txt',
'cv449_9126.txt',
'cv450_8319.txt',
'cv451_11502.txt',
'cv452_5179.txt',
'cv453_10911.txt',
'cv454_21961.txt',
'cv455_28866.txt',
'cv456_20370.txt',
'cv457_19546.txt',
'cv458_9000.txt',
'cv459_21834.txt',
'cv460_11723.txt',
'cv461_21124.txt',
'cv462_20788.txt',
'cv463_10846.txt',
'cv464_17076.txt',
'cv465_23401.txt',
'cv466_20092.txt',
'cv467_26610.txt',
'cv468_16844.txt',
'cv469_21998.txt',
'cv470_17444.txt',
'cv471_18405.txt',
'cv472_29140.txt',
'cv473_7869.txt',
'cv474_10682.txt',
'cv475_22978.txt',
'cv476_18402.txt',
'cv477_23530.txt',
'cv478_15921.txt',
'cv479_5450.txt',
'cv480_21195.txt',
'cv481_7930.txt',
'cv482_11233.txt',
'cv483_18103.txt',
'cv484_26169.txt',
'cv485_26879.txt',
'cv486_9788.txt',
'cv487_11058.txt',
'cv488_21453.txt',
'cv489_19046.txt',
'cv490_18986.txt',
'cv491_12992.txt',
'cv492_19370.txt',
'cv493_14135.txt',
'cv494_18689.txt',
'cv495_16121.txt',
'cv496_11185.txt',
'cv497_27086.txt',
'cv498_9288.txt',
'cv499_11407.txt',
'cv500_10722.txt',
'cv501_12675.txt',
'cv502_10970.txt',
'cv503_11196.txt',
'cv504_29120.txt',
'cv505_12926.txt',
'cv506_17521.txt',
'cv507_9509.txt',
'cv508_17742.txt',
'cv509_17354.txt',
'cv510_24758.txt',
'cv511_10360.txt',
'cv512_17618.txt',
'cv513_7236.txt',
'cv514_12173.txt',
'cv515_18484.txt',
'cv516_12117.txt',
'cv517_20616.txt',
'cv518_14798.txt',
'cv519_16239.txt',
'cv520_13297.txt',
'cv521_1730.txt',
'cv522_5418.txt',
'cv523_18285.txt',
'cv524_24885.txt',
'cv525_17930.txt',
'cv526_12868.txt',
'cv527_10338.txt',
'cv528_11669.txt',
'cv529_10972.txt',
'cv530_17949.txt',
'cv531_26838.txt',
'cv532_6495.txt',
'cv533_9843.txt',
'cv534_15683.txt',
'cv535_21183.txt',
'cv536_27221.txt',
'cv537_13516.txt',
'cv538_28485.txt',
'cv539_21865.txt',
'cv540_3092.txt',
'cv541_28683.txt',
'cv542_20359.txt',
'cv543_5107.txt',
'cv544_5301.txt',
'cv545_12848.txt',
'cv546_12723.txt',
'cv547_18043.txt',
'cv548_18944.txt',
'cv549_22771.txt',
'cv550_23226.txt',
'cv551_11214.txt',
'cv552_0150.txt',
'cv553_26965.txt',
'cv554_14678.txt',
'cv555_25047.txt',
'cv556_16563.txt',
'cv557_12237.txt',
'cv558_29376.txt',
'cv559_0057.txt',
'cv560_18608.txt',
'cv561_9484.txt',
'cv562_10847.txt',
'cv563_18610.txt',
'cv564_12011.txt',
'cv565_29403.txt',
'cv566_8967.txt',
'cv567_29420.txt',
'cv568_17065.txt',
'cv569_26750.txt',
'cv570_28960.txt',
'cv571_29292.txt',
'cv572_20053.txt',
'cv573_29384.txt',
'cv574_23191.txt',
'cv575_22598.txt',
'cv576_15688.txt',
'cv577_28220.txt',
'cv578_16825.txt',
'cv579_12542.txt',
'cv580_15681.txt',
'cv581_20790.txt',
'cv582_6678.txt',
'cv583_29465.txt',
'cv584_29549.txt',
'cv585_23576.txt',
'cv586_8048.txt',
'cv587_20532.txt',
'cv588_14467.txt',
'cv589_12853.txt',
'cv590_20712.txt',
'cv591_24887.txt',
'cv592_23391.txt',
'cv593_11931.txt',
'cv594_11945.txt',
'cv595_26420.txt',
'cv596_4367.txt',
'cv597_26744.txt',
'cv598_18184.txt',
'cv599_22197.txt',
'cv600_25043.txt',
'cv601_24759.txt',
'cv602_8830.txt',
'cv603_18885.txt',
'cv604_23339.txt',
'cv605_12730.txt',
'cv606_17672.txt',
'cv607_8235.txt',
'cv608_24647.txt',
'cv609_25038.txt',
'cv610_24153.txt',
'cv611_2253.txt',
'cv612_5396.txt',
'cv613_23104.txt',
'cv614_11320.txt',
'cv615_15734.txt',
'cv616_29187.txt',
'cv617_9561.txt',
'cv618_9469.txt',
'cv619_13677.txt',
'cv620_2556.txt',
'cv621_15984.txt',
'cv622_8583.txt',
'cv623_16988.txt',
'cv624_11601.txt',
'cv625_13518.txt',
'cv626_7907.txt',
'cv627_12603.txt',
'cv628_20758.txt',
'cv629_16604.txt',
'cv630_10152.txt',
'cv631_4782.txt',
'cv632_9704.txt',
'cv633_29730.txt',
'cv634_11989.txt',
'cv635_0984.txt',
'cv636_16954.txt',
'cv637_13682.txt',
'cv638_29394.txt',
'cv639_10797.txt',
'cv640_5380.txt',
'cv641_13412.txt',
'cv642_29788.txt',
'cv643_29282.txt',
'cv644_18551.txt',
'cv645_17078.txt',
'cv646_16817.txt',
'cv647_15275.txt',
'cv648_17277.txt',
'cv649_13947.txt',
'cv650_15974.txt',
'cv651_11120.txt',
'cv652_15653.txt',
'cv653_2107.txt',
'cv654_19345.txt',
'cv655_12055.txt',
'cv656_25395.txt',
'cv657_25835.txt',
'cv658_11186.txt',
'cv659_21483.txt',
'cv660_23140.txt',
'cv661_25780.txt',
'cv662_14791.txt',
'cv663_14484.txt',
'cv664_4264.txt',
'cv665_29386.txt',
'cv666_20301.txt',
'cv667_19672.txt',
'cv668_18848.txt',
'cv669_24318.txt',
'cv670_2666.txt',
'cv671_5164.txt',
'cv672_27988.txt',
'cv673_25874.txt',
'cv674_11593.txt',
'cv675_22871.txt',
'cv676_22202.txt',
'cv677_18938.txt',
'cv678_14887.txt',
'cv679_28221.txt',
'cv680_10533.txt',
'cv681_9744.txt',
'cv682_17947.txt',
'cv683_13047.txt',
'cv684_12727.txt',
'cv685_5710.txt',
'cv686_15553.txt',
'cv687_22207.txt',
'cv688_7884.txt',
'cv689_13701.txt',
'cv690_5425.txt',
'cv691_5090.txt',
'cv692_17026.txt',
'cv693_19147.txt',
'cv694_4526.txt',
'cv695_22268.txt',
'cv696_29619.txt',
'cv697_12106.txt',
'cv698_16930.txt',
'cv699_7773.txt',
'cv700_23163.txt',
'cv701_15880.txt',
'cv702_12371.txt',
'cv703_17948.txt',
'cv704_17622.txt',
'cv705_11973.txt',
'cv706_25883.txt',
'cv707_11421.txt',
'cv708_28539.txt',
'cv709_11173.txt',
'cv710_23745.txt',
'cv711_12687.txt',
'cv712_24217.txt',
'cv713_29002.txt',
'cv714_19704.txt',
'cv715_19246.txt',
'cv716_11153.txt',
'cv717_17472.txt',
'cv718_12227.txt',
'cv719_5581.txt',
'cv720_5383.txt',
'cv721_28993.txt',
'cv722_7571.txt',
'cv723_9002.txt',
'cv724_15265.txt',
'cv725_10266.txt',
'cv726_4365.txt',
'cv727_5006.txt',
'cv728_17931.txt',
'cv729_10475.txt',
'cv730_10729.txt',
'cv731_3968.txt',
'cv732_13092.txt',
'cv733_9891.txt',
'cv734_22821.txt',
'cv735_20218.txt',
'cv736_24947.txt',
'cv737_28733.txt',
'cv738_10287.txt',
'cv739_12179.txt',
'cv740_13643.txt',
'cv741_12765.txt',
'cv742_8279.txt',
'cv743_17023.txt',
'cv744_10091.txt',
'cv745_14009.txt',
'cv746_10471.txt',
'cv747_18189.txt',
'cv748_14044.txt',
'cv749_18960.txt',
'cv750_10606.txt',
'cv751_17208.txt',
'cv752_25330.txt',
'cv753_11812.txt',
'cv754_7709.txt',
'cv755_24881.txt',
'cv756_23676.txt',
'cv757_10668.txt',
'cv758_9740.txt',
'cv759_15091.txt',
'cv760_8977.txt',
'cv761_13769.txt',
'cv762_15604.txt',
'cv763_16486.txt',
'cv764_12701.txt',
'cv765_20429.txt',
'cv766_7983.txt',
'cv767_15673.txt',
'cv768_12709.txt',
'cv769_8565.txt',
'cv770_11061.txt',
'cv771_28466.txt',
'cv772_12971.txt',
'cv773_20264.txt',
'cv774_15488.txt',
'cv775_17966.txt',
'cv776_21934.txt',
'cv777_10247.txt',
'cv778_18629.txt',
'cv779_18989.txt',
'cv780_8467.txt',
'cv781_5358.txt',
'cv782_21078.txt',
'cv783_14724.txt',
'cv784_16077.txt',
'cv785_23748.txt',
'cv786_23608.txt',
'cv787_15277.txt',
'cv788_26409.txt',
'cv789_12991.txt',
'cv790_16202.txt',
'cv791_17995.txt',
'cv792_3257.txt',
'cv793_15235.txt',
'cv794_17353.txt',
'cv795_10291.txt',
'cv796_17243.txt',
'cv797_7245.txt',
'cv798_24779.txt',
'cv799_19812.txt',
'cv800_13494.txt',
'cv801_26335.txt',
'cv802_28381.txt',
'cv803_8584.txt',
'cv804_11763.txt',
'cv805_21128.txt',
'cv806_9405.txt',
'cv807_23024.txt',
'cv808_13773.txt',
'cv809_5012.txt',
'cv810_13660.txt',
'cv811_22646.txt',
'cv812_19051.txt',
'cv813_6649.txt',
'cv814_20316.txt',
'cv815_23466.txt',
'cv816_15257.txt',
'cv817_3675.txt',
'cv818_10698.txt',
'cv819_9567.txt',
'cv820_24157.txt',
'cv821_29283.txt',
'cv822_21545.txt',
'cv823_17055.txt',
'cv824_9335.txt',
'cv825_5168.txt',
'cv826_12761.txt',
'cv827_19479.txt',
'cv828_21392.txt',
'cv829_21725.txt',
'cv830_5778.txt',
'cv831_16325.txt',
'cv832_24713.txt',
'cv833_11961.txt',
'cv834_23192.txt',
'cv835_20531.txt',
'cv836_14311.txt',
'cv837_27232.txt',
'cv838_25886.txt',
'cv839_22807.txt',
'cv840_18033.txt',
'cv841_3367.txt',
'cv842_5702.txt',
'cv843_17054.txt',
'cv844_13890.txt',
'cv845_15886.txt',
'cv846_29359.txt',
'cv847_20855.txt',
'cv848_10061.txt',
'cv849_17215.txt',
'cv850_18185.txt',
'cv851_21895.txt',
'cv852_27512.txt',
'cv853_29119.txt',
'cv854_18955.txt',
'cv855_22134.txt',
'cv856_28882.txt',
'cv857_17527.txt',
'cv858_20266.txt',
'cv859_15689.txt',
'cv860_15520.txt',
'cv861_12809.txt',
'cv862_15924.txt',
'cv863_7912.txt',
'cv864_3087.txt',
'cv865_28796.txt',
'cv866_29447.txt',
'cv867_18362.txt',
'cv868_12799.txt',
'cv869_24782.txt',
'cv870_18090.txt',
'cv871_25971.txt',
'cv872_13710.txt',
'cv873_19937.txt',
'cv874_12182.txt',
'cv875_5622.txt',
'cv876_9633.txt',
'cv877_29132.txt',
'cv878_17204.txt',
'cv879_16585.txt',
'cv880_29629.txt',
'cv881_14767.txt',
'cv882_10042.txt',
'cv883_27621.txt',
'cv884_15230.txt',
'cv885_13390.txt',
'cv886_19210.txt',
'cv887_5306.txt',
'cv888_25678.txt',
'cv889_22670.txt',
'cv890_3515.txt',
'cv891_6035.txt',
'cv892_18788.txt',
'cv893_26731.txt',
'cv894_22140.txt',
'cv895_22200.txt',
'cv896_17819.txt',
'cv897_11703.txt',
'cv898_1576.txt',
'cv899_17812.txt',
'cv900_10800.txt',
'cv901_11934.txt',
'cv902_13217.txt',
'cv903_18981.txt',
'cv904_25663.txt',
'cv905_28965.txt',
'cv906_12332.txt',
'cv907_3193.txt',
'cv908_17779.txt',
'cv909_9973.txt',
'cv910_21930.txt',
'cv911_21695.txt',
'cv912_5562.txt',
'cv913_29127.txt',
'cv914_2856.txt',
'cv915_9342.txt',
'cv916_17034.txt',
'cv917_29484.txt',
'cv918_27080.txt',
'cv919_18155.txt',
'cv920_29423.txt',
'cv921_13988.txt',
'cv922_10185.txt',
'cv923_11951.txt',
'cv924_29397.txt',
'cv925_9459.txt',
'cv926_18471.txt',
'cv927_11471.txt',
'cv928_9478.txt',
'cv929_1841.txt',
'cv930_14949.txt',
'cv931_18783.txt',
'cv932_14854.txt',
'cv933_24953.txt',
'cv934_20426.txt',
'cv935_24977.txt',
'cv936_17473.txt',
'cv937_9816.txt',
'cv938_10706.txt',
'cv939_11247.txt',
'cv940_18935.txt',
'cv941_10718.txt',
'cv942_18509.txt',
'cv943_23547.txt',
'cv944_15042.txt',
'cv945_13012.txt',
'cv946_20084.txt',
'cv947_11316.txt',
'cv948_25870.txt',
'cv949_21565.txt',
'cv950_13478.txt',
'cv951_11816.txt',
'cv952_26375.txt',
'cv953_7078.txt',
'cv954_19932.txt',
'cv955_26154.txt',
'cv956_12547.txt',
'cv957_9059.txt',
'cv958_13020.txt',
'cv959_16218.txt',
'cv960_28877.txt',
'cv961_5578.txt',
'cv962_9813.txt',
'cv963_7208.txt',
'cv964_5794.txt',
'cv965_26688.txt',
'cv966_28671.txt',
'cv967_5626.txt',
'cv968_25413.txt',
'cv969_14760.txt',
'cv970_19532.txt',
'cv971_11790.txt',
'cv972_26837.txt',
'cv973_10171.txt',
'cv974_24303.txt',
'cv975_11920.txt',
'cv976_10724.txt',
'cv977_4776.txt',
'cv978_22192.txt',
'cv979_2029.txt',
'cv980_11851.txt',
'cv981_16679.txt',
'cv982_22209.txt',
'cv983_24219.txt',
'cv984_14006.txt',
'cv985_5964.txt',
'cv986_15092.txt',
'cv987_7394.txt',
'cv988_20168.txt',
'cv989_17297.txt',
'cv990_12443.txt',
'cv991_19973.txt',
'cv992_12806.txt',
'cv993_29565.txt',
'cv994_13229.txt',
'cv995_23113.txt',
'cv996_12447.txt',
'cv997_5152.txt',
'cv998_15691.txt',
'cv999_14636.txt'])
The subfolder ../moviereviews/neg
contains 1000 text files.
In [16]:
next(gen) # this walks the /pos/ subfolder
next(gen)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-16-e2a758a6db89> in <module>()
1 next(gen) # this walks the /pos/ subfolder
----> 2 next(gen)
StopIteration:
os.walk()
stopped once it had walked all subfolders.
The most efficient way to build a DataFrame from individual text files is to first build a list of dictionaries, then cast the list as a DataFrame all at once.
We'll take the following steps to build our list:
label
is either 'neg' or 'pos', and review
is the text of the file.
In [20]:
row_list = []
for subdir in ['neg','pos']:
for folder, subfolders, filenames in os.walk('../moviereviews/'+subdir):
for file in filenames:
d = {'label':subdir} # assign the name of the subdirectory to the label field
with open('moviereviews/'+subdir+'/'+file) as f:
if f.read(): # handles the case of empty files, which become NaN on import
f.seek(0)
d['review'] = f.read() # assign the contents of the file to the review field
row_list.append(d)
break
In [21]:
df = pd.DataFrame(row_list)
In [22]:
df.head()
Out[22]:
label
review
0
neg
NaN
1
neg
the happy bastard's quick movie review \ndamn ...
2
neg
it is movies like these that make a jaded movi...
3
neg
" quest for camelot " is warner bros . ' firs...
4
neg
synopsis : a mentally unstable man undergoing ...
In [ ]:
Content source: rishuatgithub/MLPy
Similar notebooks: